System and method of analyzing audio data samples associated with speech recognition

ABSTRACT

A particular apparatus includes a first buffer that is configured to store multiple audio data samples and a second buffer that is configured to store the multiple audio data samples. The first buffer is coupled to a first processor that is configured to analyze audio data samples to detect a keyword. The second buffer is coupled to a second processor that is configured to initialize a speech recognition engine (SRE) based on the multiple audio data samples. The first buffer has less storage capacity that the second buffer.

I. CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims priority from and is a continuationapplication of U.S. Non-Provisional patent application Ser. No.14/573,402, filed Dec. 17, 2014 and entitled “SYSTEM AND METHOD OFANALYZING AUDIO DATA SAMPLES ASSOCIATED WITH SPEECH RECOGNITION,” thecontents of which are incorporated herein by reference in theirentirety.

II. FIELD

The present disclosure is generally related to speech recognition.

III. DESCRIPTION OF RELATED ART

Advances in technology have resulted in smaller and more powerfulcomputing devices. For example, there currently exist a variety ofportable personal computing devices, including wireless telephones suchas mobile and smart phones, tablets and laptop computers that are small,lightweight, and easily carried by users. These devices can communicatevoice and data packets over wireless networks. Further, many suchdevices incorporate additional functionality such as a digital stillcamera, a digital video camera, a digital recorder, and an audio fileplayer. Also, such devices can process executable instructions,including software applications, such as a web browser application, thatcan be used to access the Internet. As such, these devices can includesignificant computing capabilities.

To enable hands-free operation, mobile devices increasingly allow usersto provide input via speech. Speech recognition functions can useconsiderable processing power. Accordingly, a mobile device may turn offcertain speech recognition capabilities (e.g., circuits or subsystemsused for speech recognition) while the mobile device is in a low powerstate (e.g., a sleep or standby state) in order to conserve power. Themobile device may include a speech detector that is operational when themobile device is in the low-power state. The speech detector mayactivate the speech recognition capabilities (e.g., by waking up thecircuits or subsystems used for speech recognition) when speech isdetected. For example, when the speech detector detects user speech (ora user utterance corresponding to a keyword), the speech detector mayactivate the speech recognition capabilities. When the speechrecognition capabilities are ready to receive input, the mobile devicemay provide a prompt to the user indicating that the mobile device isready to receive a command. Because of the time it takes to prepare thespeech recognition capabilities to receive input, there can be a delaybetween when the user speaks to wake up the mobile device and when theuser is able to provide the command.

IV. SUMMARY

It may be more convenient to the user to awaken a mobile device andprovide a command to the mobile device using a single sentence orutterance (e.g., without waiting for a prompt from the mobile device toindicate that the mobile device is ready to receive a command). Forexample, the user may prefer to use an ordinary sentence structure toaddress the mobile device (e.g., using a keyword) and to state thecommand or interrogatory (e.g. a command phrase). To illustrate, theuser may state “Device, what is the weather like outside?” withoutunusual pauses.

In order to accommodate such a keyword/command phrase input (e.g., asentence) when the speech recognition capabilities of the mobile deviceare in a low power state, the mobile device may buffer (e.g., save in amemory device) audio data corresponding to the speech and subsequentlyprocess the audio data when the speech recognition capabilities areready. The speech recognition capabilities may be provided by or maycorrespond to a processor that executes a speech recognition engine.Thus, making the speech recognition capabilities ready to process speechmay include transitioning the processor from a low power state (e.g. asleep state or standby state) to a higher power state (e.g., a readystate) and loading instructions corresponding to the speech recognitionengine to the processor.

In a particular aspect, a mobile device may include a coder/decoder(CODEC) that includes a keyword detector (e.g., a digital signalprocessor executing instructions to detect a keyword (or a set ofkeywords) in user speech). The mobile device may also include anapplication processor configured to execute a speech recognition engine.The keyword detector may be a relatively low power device as compared tothe application processor. Thus, when the mobile device is in a lowpower state, the keyword detector may remain active and the applicationprocessor may be inactive (e.g., in a sleep or standby state). The CODECmay also include a first buffer. When the keyword detector detects akeyword in audio data corresponding to an utterance from the user, theCODEC may buffer the audio data at the first buffer. Additionally, inresponse to detecting the keyword, the CODEC may send an indication tothe application processor to cause the application processor to awakenfrom the low power state. In response to receiving an indication fromthe CODEC, the application processor may activate (e.g., initialize) abus that couples the application processor to the CODEC to enablecommunication between the CODEC and the application processor.Additionally, the application processor may begin initializing thespeech recognition engine (e.g., loading instructions corresponding tothe speech recognition engine from a memory).

The CODEC may continue to buffer the audio data until the bus betweenthe application processor and the CODEC is active. When the bus isready, the audio data at the first buffer may be transferred via the busto a second buffer at the application processor. The first buffer mayhave less storage capacity than the second buffer (e.g., to reduce costassociated with the first buffer). For example, while the second buffermay be capable of storing audio data associated with an entire commandphrase, the first buffer may not have sufficient capacity to store theentire command phrase. After the keyword is detected and while thespeech recognition engine is being prepared, audio data (e.g., portionsof the command phrase) may continue to be received at the mobile device.The audio data may be buffered at the first buffer and transferred, in afirst in first out manner, from the first buffer to the second buffervia the bus. Thus, the first buffer need not be large enough to storethe entire command phrase. Rather, it is sufficient for the first bufferto be large enough to store audio data that is received during a timefor the bus to be initialized.

The second buffer may continue to receive and buffer audio data whilethe speech recognition engine is prepared for execution. When the speechrecognition engine is ready, the speech recognition engine may accessthe audio data from the second buffer to perform speech recognition todetermine whether the audio data includes a command phrase. When theaudio data includes a command phrase, the application processor maycause an action corresponding to the command phrase to be performed. Forexample the application processor may cause an application associatedwith the command phrase to be executed or may provide input to theapplication based on the command phrase.

In a particular aspect, an apparatus includes a first buffer configuredto store multiple audio data samples and a second buffer configured tostore the multiple audio data samples. The first buffer is coupled to afirst processor that is configured to analyze audio data samples todetect a keyword. The second buffer is coupled to a second processorthat is configured to initialize a speech recognition engine (SRE) basedon the multiple audio data samples. The first buffer has less storagecapacity that the second buffer.

In another particular aspect, an apparatus includes a coder/decoder(CODEC) including a first processor and a first buffer. The firstprocessor is configured to analyze audio data samples to detect akeyword, and the CODEC is configured to store a set of audio datasamples at the first buffer. The apparatus also includes an applicationprocessor configured to receive the set of audio data samples from theCODEC via a bus and configured to initialize a speech recognition engine(SRE) based on the set of audio data samples. The application processoris configured to initialize the bus based on an indication from theCODEC that the keyword is detected.

In another particular aspect, a method includes obtaining audio datasamples at a first processor and analyzing the audio data samples todetect a keyword. The method also includes, after detecting the keyword,storing a set of audio data samples at a first buffer of a CODEC andsending an indication of detection of the keyword to an applicationprocessor. The application processor is configured to initializes a busto enable communication between the CODEC and the application processorbased on the indication from the CODEC. The method also includes, afterthe bus is initialized, sending the set of audio data samples to theapplication processor to perform speech recognition.

In another particular aspect, a computer-readable storage device storesinstructions that are executable by a processor of a coder/decoder(CODEC) to cause the processor to perform operations including analyzingaudio data samples to detect a keyword. The operations also include,after detecting the keyword, storing a set of audio data samples at afirst buffer of the CODEC and sending an indication of detection of thekeyword to an application processor. The application processor isconfigured to initialize a bus to enable communications between theCODEC and the application processor based on the indication from theCODEC. The operations also include, after the bus is initialized,sending the set of audio data samples to the application processor toperform speech recognition.

One particular advantage provided by at least one of the disclosedembodiments is that buffering audio data at the CODEC before providingthe audio data to a second buffer at the application processor, asdescribed herein, allows the user to conveniently provide akeyword/command phrase sentence (without waiting for the mobile deviceto wake up and provide a prompt). Additionally, cost of the mobiledevice is not significantly increased by this arrangement because arelatively low cost buffer can be used at the CODEC since the buffer atthe CODEC does not need to be large enough to store the entire commandphrase.

Other aspects, advantages, and features of the present disclosure willbecome apparent after review of the entire application, including thefollowing sections: Brief Description of the Drawings, DetailedDescription, and the Claims.

V. BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram illustrating a particular embodiment of the system100 that includes a device 102 that is capable of receiving commands viaspeech;

FIG. 2 is a diagram illustrating a particular embodiment of a firststage during interaction between a CODEC and an application processor ofthe device of FIG. 1;

FIG. 3 is a diagram illustrating a particular embodiment of a secondstage during interaction between the CODEC and the application processorof the device of FIG. 1;

FIG. 4 is a diagram illustrating a particular embodiment of a thirdstage during interaction between the CODEC and the application processorof the device of FIG. 1;

FIG. 5 is a diagram illustrating a particular embodiment of a fourthstage during interaction between the CODEC and the application processorof the device of FIG. 1;

FIG. 6 is a diagram illustrating a particular embodiment of a fifthstage during interaction between the CODEC and the application processorof the device of FIG. 1;

FIG. 7 is a diagram illustrating a particular embodiment of a sixthstage during interaction between the CODEC and the application processorof the device of FIG. 1;

FIG. 8 is a flowchart illustrating a particular embodiment of a methodperformed by the device of FIG. 1; and

FIG. 9 is a block diagram of a particular embodiment of an electronicdevice including the CODEC and the application processor of FIG. 1.

VI. DETAILED DESCRIPTION

FIG. 1 is a diagram illustrating a particular embodiment of a system 100that includes a device 102 that is capable of receiving commands viaspeech. The device 102 may include or correspond to a mobile device, aportable telephony device, a computing device (e.g., a tablet computer,a personal digital assistant, a laptop computer, etc.), a navigationdevice, a wearable computing device, an in-vehicle computing device(such as a driver assistance device), or another device configured toreceive commands via speech. The device 102 may include an audiotransducer, such as a microphone 108, that is capable of detecting anutterance 106 from a user 104. In some instances, the utterance 106 mayinclude a keyword 110, a command phrase 112, or both. For example theuser 104 may speak the keyword 110 followed by the command phrase 112(e.g., as a keyword/command phrase sentence). The device 102 may beconfigured to receive the keyword 110 and the command phrase 112 withoutrequiring that the user 104 wait for a prompt between the keyword 110and the command phrase 112.

The device 102 includes a coder/decoder (CODEC) 120 coupled to themicrophone 108 and an application processor 150 coupled to the CODEC viaa bus 140. For example, the CODEC 120 may include a first bus interface138 coupled to the bus 140, and the application processor 150 mayinclude a second bus interface 152 coupled to the bus 140. The device102 may also include a memory 160 that stores instructions 162. Theinstructions 162 may be executable by the application processor 150 orby another processor (not shown) of the device 102. For example, theinstructions 162 may correspond to an application that is executablebased on a command in the command phrase 112. To illustrate, theapplication may be a navigation or map program that is executed inresponse to the command phrase “where am I?” (or another navigation orlocation related command phrase). As another example, the applicationmay be executable to receive input based on the command phrase 112. Toillustrate, the application may be a search application that receives asearch query responsive to the command phrase 112. In yet anotherexample, the application may be executed and may receive input based onthe command phrase 112. To illustrate, in response to the command phrase“call mom,” a communication application may be started and inputindicating a type of communication (e.g., a call) or a destination ofthe communication (e.g., a telephone number associated with a contactidentified as “mom”) may be provided to the communication application.

In a particular embodiment, keyword detection and command phraserecognition functions are divided in the device 102 between the CODEC120 and the application processor 150, respectively. For example, theCODEC 120 may include a processor configured to execute a keyworddetector 130. The keyword detector 130 is configured to analyze audiodata samples to detect keywords, such as the keyword 110. Theapplication processor 150 may be configured to initialize and execute aspeech recognition engine (SRE) 154 to identify a command phrase, suchas the command phrase 112, in an utterance 106 by the user.

In the embodiment illustrated in FIG. 1, the CODEC 120 includes an audiodetector 122. The audio detector 122 may be configured to receive audiodata 111 from the microphone 108 or from other circuitry (not shown),such as an analog-to-digital converter (ADC) or other circuitry coupledto the microphone 108. The audio detector 122 may be configured todetermine whether an acoustic signal received at the microphone 108 andrepresented by the audio data 111 satisfies a threshold. For example,the audio detector 122 may determine whether the audio data 111satisfies a threshold. To illustrate, the threshold may be a volumethreshold, and the audio detector 122 may determine whether the audiodata 111 is sufficiently loud to indicate that speech may be present inthe acoustic signal. As another illustrative example, the threshold maybe a frequency threshold, and the audio detector 122 may determinewhether the audio data 111 is within a frequency range corresponding tohuman speech.

When the audio detector 122 determines that the audio data 111 is ofinterest (e.g., may include speech), the portion of the audio data 111may be provided to the keyword detector 130, as audio data samples 115.In a particular embodiment, the portion of the audio data 111 isprovided to the keyword detector 130 via a plurality of buffers (e.g.,alternating buffers 124). In this embodiment, the alternating buffers124 include a first alternating buffer 126 and a second alternatingbuffer 128. When one of the alternating buffers (e.g., the secondalternating buffer 128) is in a receive mode, the other alternatingbuffer (e.g., the first alternating buffer 126) is in a send mode. Thealternating buffer that is in the receive mode (e.g., the secondalternating buffer 128) may receive and store audio data from themicrophone 108 (e.g., a portion of the audio data 111). The alternatingbuffer that is in the send mode (e.g., the first alternating buffer 126)may send audio data 113 stored at the alternating buffer to the keyworddetector (e.g., as a portion of the audio data samples 115). When thealternating buffer in the receive mode is full, the alternating buffersswitch roles (e.g., the alternating buffer in the receive mode switchesto the send mode, and the alternating buffer in the send mode switchesto the receive mode). In other embodiments, the alternating buffers 124are not used and other buffering mechanisms are used to allow thekeyword detector 130 to receive and analyze the audio data samples 115.

The audio data samples 115 may correspond to portions of the audio data111 that may include speech based on the determination by the audiodetector 122. The keyword detector 130 may process the audio datasamples 115 to detect a keyword, such as the keyword 110. For example,the keyword detector 130 may compare the audio data samples 115 (orfeatures extracted from the audio data samples) to a keyword model froma memory. In another example, the keyword detector 130 may analyze theaudio data samples (or features extracted from the audio data samples115) using a temporal pattern recognition process, such as a Markovchain model, a hidden Markov model, a semi-Markov model, or acombination thereof. The memory may correspond to a first buffer 132 ormay correspond to a different memory (not shown).

When the keyword detector 130 detects a keyword, the keyword detector130 may provide a signal to an interrupt controller 134. The interruptcontroller 134 may provide an indication 136 to the applicationprocessor 150 (e.g., via a connection other than the bus 140). Prior toreceiving the indication 136, the application processor 150 and the bus140 may be in a low power state (e.g., a sleep or standby state). Inresponse to the indication 136, the application processor 150 may begininitializing the bus 140. Additionally, the application processor 150may begin initialization of the speech recognition engine 154.

After detecting the keyword in the audio data samples 115, the keyworddetector 130 may cause a set of audio data samples 117 to be stored atthe first buffer 132. The set of audio data samples 117 may include allof the audio data samples 115 or may correspond to a subset of the audiodata samples 115. For example, the set of audio data samples 117 mayexclude audio data samples corresponding to the keyword. As anotherexample, the keyword may be detected by the keyword detector 130 beforeall of the audio data samples corresponding to the keyword have beenanalyzed. To illustrate, as each audio data sample is analyzed, thekeyword detector 130 may update a confidence value indicating aconfidence that a keyword is present in the audio data samples 115. Whenthe confidence satisfies a threshold, the keyword detector 130 maydetermine that the keyword is present. In this example, the set of audiodata samples 117 may include audio data samples that are received afterthe keyword detector 130 determines that the keyword is present. Thus,the set of audio data samples 117 may include a portion of a keyword. Inthis example, the keyword detector 130 may identify a last audio datasample of the keyword in the set of audio data samples 117.

In a particular embodiment, establishing a communication connection 170via the bus 140 is not instantaneous. For example, initializing the bus140 and preparing the bus 140 for communication between the CODEC 120and the application processor 150 may take on the order of tens tohundreds of milliseconds (e.g., 10 milliseconds to 1000 milliseconds).Thus, additional data may be received from the microphone 108 after theindication 136 is sent and before the bus 140 is ready. The additionaldata may be stored at the first buffer 132 (e.g., as part of the set ofaudio data samples 117) until the bus 140 is ready.

When the bus 140 is ready (e.g., when the communication connection 170is established between the CODEC 120 and the application processor 150),at least a portion 119 of the set of audio data samples 117 may betransferred over the communication channel 170 to the second buffer 156.In a particular embodiment, all of the audio data samples of the set ofaudio data samples 117 are transferred to the second buffer 156. Inanother embodiment, a subset of the set of audio data samples 117 istransferred. For example, as explained above, the set of audio datasamples 117 may include data corresponding to at least a portion of thekeyword 110. In this example, the portion 119 of the set of audio datasamples 117 transferred to the second buffer 156 may exclude the portionof the keyword 110. For example, the keyword detector 130 may determinethe final audio data sample of the keyword 110, and the next audio datasample may be used as a starting point for transferring of the portion119 of the set of audio data samples 117 to the second buffer 156.

As additional audio data samples are received from the microphone 108,the additional data samples may be buffered at the first buffer 132 andsubsequently transferred (in a first in first out manner) to the secondbuffer 156. On average (e.g., over a second), the audio data samplesbuffered at the first buffer 132 may be sent to the second buffer 156 atthe same rate that new audio data samples are received at the firstbuffer 132. Thus, the first buffer 132 does not need to store the entirecommand phrase 112. Rather, it is sufficient for the first buffer 132 tohave capacity to store audio data samples that may be received during atime period between a first time at detection of the keyword 110 and asecond time when the bus 140 is ready. For example, the first buffer 132may be sized (e.g., have capacity) to store approximately 250milliseconds worth of audio data samples at an audio sampling rate of 16samples per millisecond.

The second buffer 156 may continue to store audio data samples 119 untilthe speech recognition engine 154 is ready. When the speech recognitionengine 154 is ready, the speech recognition engine 154 may access thesecond buffer 156 and perform speech recognition on the audio datasamples 119 from the second buffer 156. For example, the speechrecognition engine 154 may determine whether the utterance 106 includesa command phrase, such as the command phrase 112. To illustrate, thespeech recognition engine 154 may analyze the audio data samples 119 (orfeatures extracted from the audio data samples 119) using a temporalpattern recognition process, such as a Markov chain model, a hiddenMarkov model, a semi-Markov model, or a combination thereof. When thespeech recognition engine 154 detects and recognizes the command phrase112, the application processor 150 may determine whether the commandphrase 112 is mapped to a particular action (e.g., based on mappinginformation (not shown) in the memory 160. If the command phrase 112 ismapped to a particular action, the application processor 150 mayinitiate the particular action responsive to the command phrase 112. Forexample, the application processor 150 may cause the instructions 162from the memory 160 to be executed to provide a service (e.g., display amap) or a response (e.g., provide directions) based on the commandphrase 112. Thus, the system 100 enables the user 104 to convenientlyprovide a keyword/command phrase sentence without waiting for the device102 to wake up and provide a prompt.

FIGS. 2-7 illustrate stages of interaction between the CODEC 120 and theapplication processor 150 of FIG. 1. FIG. 2 illustrates a particularembodiment of a first stage during interaction between the CODEC 120 andthe application processor 150. In the first stage, no keyword 110 hasbeen detected by the keyword detector 130. For example, other sounds 210may be received by the microphone 108 before the keyword 110 is receivedat the microphone 108. When the other sounds 210 are received, the audiodetector 122 of FIG. 1 may indicate that no speech is present (e.g., theother sounds 210 fail to satisfy a threshold). Alternately, the audiodetector 122 may determine that the other sounds 210 satisfy thethreshold and may provide audio data to the keyword detector 130;however, the keyword detector 130 may determine that the audio datacorresponding to the other sounds 210 does not include the keyword 110.During the first stage illustrated in FIG. 2 (e.g., before the keyword110 is detected), the application processor 150 and the bus 140 may bein a low power state, such as a sleep state or standby state, in orderto conserve power.

FIG. 3 illustrates a particular embodiment of a second stage duringinteraction between the CODEC 120 and the application processor 150. Inthe second stage, the keyword 110 is detected by the keyword detector130. For example, the keyword detector 130 may analyze received audiodata samples and may determine a confidence value indicating alikelihood (based on the received audio data samples) that the keywordis present in the utterance 106. When the confidence value satisfies athreshold (e.g., indicates a relatively high probability that theutterance 106 includes the keyword 110), the keyword detector 130 maydetermine that the keyword 110 is present. Thus, for example, thekeyword detector 130 may detect the keyword 110 based on only a portionof the keyword 110.

When the keyword 110 is detected, the keyword detector 130 may causeaudio data samples to be stored at the buffer 132. For example, asillustrated in FIG. 3, audio data samples corresponding to a start 310of the keyword 110 may be stored at the buffer 132 while an end of thekeyword 312 is still being received at the microphone 108. Thus, in aparticular embodiment, buffering of audio data samples at the firstbuffer 132 may begin before the command phrase 112 is received at themicrophone 108.

Additionally, when the keyword 110 is detected, the keyword detector 130may send an indication 330 of detection of the keyword 110 to theapplication processor 150 (e.g., via a connection (not shown) betweenthe CODEC 120 and the application processor 150 that is distinct fromthe bus 140). In response to the indication 330, the applicationprocessor 150 may begin transitioning from a low power state to a higherpower state (e.g., a ready state). For example, the applicationprocessor 150 may begin initializing the bus 140. Additionally, theapplication processor 150 may begin loading instructions correspondingto the speech recognition engine 154.

FIG. 4 illustrates a particular embodiment of a third stage duringinteraction between the CODEC 120 and the application processor 150. Inthe third stage, the keyword 110 has been detected but the bus 140 isnot ready. Thus, the third stage corresponds to a time period for thebus 140 to transition from the low power state to the awake state, whichmay be more than 10 milliseconds. Since the microphone 108 may samplethe acoustic signal multiple times during the time period for the bus140 to be readied, additional samples may be received at the CODEC 120after detection of the keyword 110 and before the bus 140 is ready. Theadditional audio data samples may be buffered at the first buffer 132.Additionally, during the third stage, the keyword detector 130 mayidentify a last audio data sample of the keyword 110 in the buffer 132.To illustrate, in the example illustrated in FIG. 4, the first buffer132 includes several audio data samples corresponding to the keyword 110and several audio data samples corresponding to the command phrase (CP)112 or other audio data samples that are subsequent to the last audiodata sample of the keyword 110. The keyword detector 130 may determinewhich audio data sample in the buffer 132 is the last audio data sampleof the keyword 110.

FIG. 5 illustrates a particular embodiment of a fourth stage duringinteraction between the CODEC 120 and the application processor 150. Inthe fourth stage, the bus 140 is ready (e.g., the communication channel170 is available between the CODEC 120 and the application processor150). When the bus 140 is ready, a set of audio data samples followingthe last audio data sample of the keyword 110 may be transferred via thebus 140 from the first buffer 132 to the second buffer 156. Audio datasamples that correspond to the keyword 110 may be flushed from the firstbuffer 132. In another embodiment, the audio data samples correspondingto the keyword 110 may be transferred to the second buffer 156 via thebus 140 and may be omitted by the speech recognition engine 154 fromprocessing.

FIG. 6 illustrates a particular embodiment of a fifth stage duringinteraction between the CODEC 120 and the application processor 150. Inthe fifth stage, the audio data samples stored at the first buffer 132when the bus 140 was ready (e.g., at the fourth stage) have all beensent to the second buffer 156. The speech recognition engine 154 maytake longer to prepare than the bus 140. For example, the speechrecognition engine 154 may take more than one second to be ready.Accordingly, additional audio data samples may be received after the bus140 is ready and before the speech recognition engine 154 is ready. Theadditional audio data samples may be buffered at the first buffer 132and subsequently transferred to the second buffer 156 while the speechrecognition engine 154 is being prepared.

In some instances, the command phrase 112 may be longer than can bestored at the first buffer 132 based on a capacity of the first buffer132. Accordingly, the first buffer 132 may receive audio data samplesfrom a microphone 108 at approximately the same average rate that itsends audio data samples to the second buffer 156. In another example,the first buffer 132 may send audio data samples to the second buffer156 at a rate that is greater than a rate that the audio data samplesare received by the first buffer 132. The second buffer 156 may beconsiderably larger than the first buffer 132. Accordingly, the secondbuffer 156 may have sufficient capacity to store the entire commandphrase 112 while the speech recognition engine 154 is being prepared.

FIG. 7 illustrates a particular embodiment of a sixth stage duringinteraction between the CODEC 120 and the application processor 150. Inthe sixth stage, the speech recognition engine 154 is ready. When thespeech recognition engine 154 is loaded and ready to execute, the speechrecognition engine 154 may access audio data samples from the secondbuffer 156. The speech recognition engine 154 may perform speechrecognition on the audio data samples from the second buffer 156, andmay initiate other actions based on the detected command phrase 112. Forexample, the speech recognition engine 154 may cause instructionscorresponding to another application to be executed at the applicationprocessor 150. In some embodiments, when the command phrase isparticularly long, audio data samples may continue to be provided fromthe first buffer 132 to the second buffer 156 after the speechrecognition engine 154 is prepared. The SRE 154 may process the audiodata samples from the second buffer 156 in real time (e.g., atapproximately the same average rate that the audio data samples arereceived at the second buffer 156) or may process the audio data samplesfrom the second. buffer 156 at a rate that is faster than in real time(e.g., at an average rate that is greater than a rate at which the audiodata samples are received at the second buffer 156). When the SRE 154processes the audio data samples faster than real time, overall latencyof recognizing commands and taking corresponding actions can be reducedsince delay associated with the SRE 154 is low. Thus, the CODEC 120 andapplication processor 150 enable a user to conveniently provide akeyword/command phrase sentence without waiting for the applicationprocessor 150 to wake up and provide a prompt.

When the speech recognition engine 154 has performed speech recognitionto identify the command phrase 112 (e.g., text corresponding to thecommand phrase 112), the application processor 150 may determine anaction to be performed responsive to the command phrase. For example,the memory 160 may include data mapping particular command phrases tocorresponding actions. After determining an action corresponding to thecommand phrase 112, the application processor 150 may cause theaction(s) to be performed. If no additional input is received (e.g.,there is no activity), the application processor 150 may subsequentlyreturn to a low power state. For example, the application processor 150and the bus 140 may be transitioned back to the low power state to awaitadditional input.

Although FIGS. 1-7 have been described in terms of receiving andprocessing audio data, the CODEC 120 and application processor 150 maybe used in a similar manner to process other data. For example, ratherthan, or in addition to, receiving audio data samples, the CODEC 120 mayreceive image frames (e.g., video data samples). In this example, theaudio detector 122 may be replaced with or supplemented with an imagepattern detector (e.g., a light detector, an edge detector, a colordetector, a motion detector, or another relatively simple, fastdetector) that can screen image data substantially in real time todetect an image frame or a set of image frames that may be of interest.Additionally, in this example, the keyword detector 130 may be replacedwith or supplemented with an image processing device, such as a facedetector or an object detector. The image processing device maydetermine whether an object of interest is present in the image frame orthe set of image frames and may cause image frames to be buffered at thefirst buffer 132 while a more complex processing system, such as a videoprocessing system, is prepared for execution by the applicationprocessor 150. When the bus 140 is ready, the image frames may betransferred (in a first-in-first-out manner, as described above) fromthe first buffer 132 to the second buffer 156 for processing by thevideo processing system. The video processing system may analyze theimage frames from the second buffer 156 to detect, for example, agesture, a facial expression, or another visual cue. In this example,the video processing system may cause an action to be performed based onthe analysis of the image frames. To illustrate, the system in thisexample may be used in a gesture recognition system. Thus, a gesture bya user may be detected and an action corresponding to the gesture may beperformed. In addition to, or instead of, audio data and image data, theCODEC 120 and application processor 150 may be used to process otherdata, such as data from a sensor, in circumstances which may benefitfrom keeping the application processor 150 in a low power state untilrelevant data is received, then buffering the data while the applicationprocessor 150 is readied to analyze the data.

FIG. 8 is a flow chart of a particular embodiment of a method 800 ofperforming speech recognition. The method 800 may be performed by thedevice 102 of FIG. 1. The method 800 includes, at 802, sampling anddigitizing acoustic signals received at an audio transducer. Forexample, the audio transducer may include or correspond to themicrophone 108 of FIG. 1. The microphone 108 of FIG. 1 and othercircuitry, such as an ADC, may sample and digitize acoustic signals,such as the utterance 106, to generate the audio data 111.

The method 800 may also include, at 804, obtaining audio data samples ata keyword detector of a CODEC. The audio data samples may correspond toportions of the sampled and digitized acoustic signals that satisfy athreshold. For example, the audio data 111 may be provided to the audiodetector 122 of FIG. 1. The audio detector 122 may determine whether anyportion of the audio data 111 satisfied a threshold that indicates thatspeech may be present in the audio data 111. When the audio detector 122determines that a portion of the audio data 111 may include speech, theportion of the audio data 111 may be provided to the keyword detector asthe audio data samples 115.

The method 800 also includes, at 806, analyzing the audio data samplesusing the keyword detector to detect a keyword. For example, the keyworddetector 130 may analyze the audio data samples 115 to determine whetherthe audio data samples 115 include a keyword. The method may alsoinclude, at 808, after detecting the keyword, storing a set of audiodata samples at a first buffer of the CODEC. For example, after thekeyword detector 130 detects the keyword 110 based on the audio datasamples 115, the keyword detector 130 may cause the first buffer 132 tostore a set of audio data samples 117. The set of audio data samples 117may include all of the audio data samples 115 or may include a subset ofthe audio data samples 115, such as those portions of the audio datasamples 115 received after the keyword 110 was detected.

In a particular embodiment, the method 800 includes, at 810, determininga final keyword audio data sample corresponding to an end of thekeyword. For example, the keyword detector 130 may analyze the set ofaudio data samples 117 to determine which audio data sample of the setof audio data samples 117 corresponds to the last audio data sample ofthe keyword 110. In other embodiments, the method 800 does not includedetermining the final keyword audio data sample.

After detecting the keyword, the method 800 may also include, at 812,sending an indication of detection of the keyword to the applicationprocessor. The application processor may be configured to initialize abus to enable communication between the CODEC and the applicationprocessor based on the indication from the CODEC. For example, when thekeyword detector 130 detects the keyword 110 based on the audio datasamples 115, the keyword detector 130 may cause the indication 136 to betransmitted to the application processor 150. In response to receivingthe indication 136, the application processor 150 may initialize the bus140. Additionally, the application processor 150 may begin preparing thespeech recognition engine 154. For example, the application processor150 may access instructions from the memory 160 and may loadinstructions corresponding to the speech recognition engine 154 toworking memory of the application processor 150. When the bus 140 isprepared (e.g., after the bus is initialized), the bus interface 152 ofthe application processor 150 may provide a signal to the bus interface138 of the CODEC 120. The signal may indicate that the bus 140 is ready.While the bus 140 is being prepared, audio data samples received fromthe microphone 108 may continue to be stored at the first buffer 132.

After the bus is initialized, the method 800 may include, at 814,sending the set of audio data samples (e.g., from the first buffer viathe bus) to the application processor to perform speech recognition. Inembodiments that include determining the final keyword audio datasample, at 810, the audio data samples may be sent via the bus beginningafter the final keyword audio data sample. For example, the first buffer132 may include the set of audio data samples 117. After a communicationchannel 170 is available, via the bus 140, to second buffer 156, the setof audio data samples 119 may be transferred from the first buffer 132to the second buffer 156. The set of audio data samples 119 may includeaudio data samples received after the final keyword audio data sample.In embodiments that do not include determining the final keyword audiodata sample, at 810, all of the audio data samples received at the firstbuffer or a set of the audio data samples received at the first bufferafter the keyword is detected may be sent to the application processor.The set of audio data samples 119 may include audio data samples thatwere not in the first buffer 132 when the communication connection 170became available. For example, the microphone 108 may continue toreceive acoustic signals corresponding to the utterance 106 and maygenerate additional audio data samples corresponding to the acousticsignals. The additional audio data samples may be stored at the firstbuffer 132. The first buffer 132 may act as a first in first out bufferto receive the additional audio data samples and to transfer theadditional audio data samples via the bus 140 to the second buffer 156while the speech recognition engine 154 is being prepared. After thespeech recognition engine 154 is prepared, the speech recognition engine154 may access the second buffer 156 to perform speech recognition basedon the audio data samples stored at the second buffer 156.

Referring to FIG. 9, a block diagram of a particular illustrativeembodiment of an electronic device is depicted and generally designated900. The electronic device 900 may correspond to the device 102 ofFIG. 1. For example, the electronic device 900 may include the CODEC 120and the application processor 150. The electronic device 900 may includeor correspond to a mobile device, a portable telephony device, acomputing device (e.g., a tablet computer, a personal digital assistant,a laptop computer, etc.), a navigation device, a wearable computingdevice, an in-vehicle computing device (such as a driver assistancedevice), or another device configured to receive commands via speech.

The application processor 150 may include a digital signal processor(DSP). The application processor 150 may be coupled to a memory 932. Thememory 932 may include instructions that are executable by theapplication processor 150, such as instructions corresponding to one ormore applications 912.

The electronic device 900 may also include a display controller 926 thatis coupled to the application processor 150 and to a display 928. TheCODEC 120 may also be coupled to the application processor 150 via thebus 140. A speaker 936 and the microphone 108 can be coupled to theCODEC 120. In a particular embodiment, as explained above, the CODEC 120includes the keyword detector 130 and the first buffer 132. The keyworddetector 130 is configured to analyze audio data samples (received fromthe microphone 108) to detect a keyword. The CODEC 120 is configured tostore a set of audio data samples at the first buffer 132. Additionally,the application processor 150 is configured to receive the set of audiodata samples from the CODEC 120 via the bus 140 and to initialize andexecute the speech recognition engine (SRE) 154 based on the set ofaudio data samples. The application processor 150 is also configured toinitializes the bus 140 based on an indication from the CODEC 120 thatthe keyword is detected.

FIG. 9 also indicates that the electronic device 900 can include awireless controller 940 coupled to the application processor 150 and toan antenna 942. In a particular embodiment, the application processor150, the display controller 926, the memory 932, the CODEC 120, and thewireless controller 940 are included in a system-in-package orsystem-on-chip device 922. In a particular embodiment, an input device930 and a power supply 944 are coupled to the system-on-chip device 922.Moreover, in a particular embodiment, as illustrated in FIG. 9, thedisplay 928, the input device 930, the speaker 936, the microphone 108,the antenna 942, and the power supply 944 are external to thesystem-on-chip device 922. However, each of the display 928, the inputdevice 930, the speaker 936, the microphone 108, the antenna 942, andthe power supply 944 can be coupled to a component of the system-on-chipdevice 922, such as an interface or a controller.

In conjunction with the described embodiments, a system is disclosedthat includes means for obtaining audio data samples and analyzing theaudio data samples to detect a keyword. For example, the means forobtaining audio data samples and analyzing the audio data samples todetect a keyword may correspond to the CODEC 120 of FIGS. 1-7, thekeyword detector 130 of FIGS. 1-7, one or more other devices or circuitsconfigured to store one or more bits, or any combination thereof. Thesystem may also include means for storing a set of audio data samplesafter detecting the keyword. For example, the means for storing a set ofaudio data samples after detecting the keyword may correspond to theCODEC 120 of FIGS. 1-7, the first buffer 132 of FIGS. 1-7, one or moreother devices or circuits configured to store one or more bits, or anycombination thereof. The system may also include means for sending anindication of detection of the keyword to an application processor afterdetecting the keyword. For example, the means for sending an indicationof detection of the keyword to an application processor after detectingthe keyword may correspond to the CODEC 120 of FIGS. 1-7, the keyworddetector 130 of FIGS. 1-7, the interrupt controller 134 of FIG. 1, oneor more other devices or circuits configured to store one or more bits,or any combination thereof. The system may also include means forsending the set of audio data samples via the bus to the applicationprocessor to perform speech recognition based on the set of audio datasamples. For example, the means for sending the set of audio datasamples may correspond to the CODEC 120 of FIGS. 1-7, the keyworddetector 130 of FIGS. 1-7, the bus interface 138 of FIG. 1, the bus 140of FIGS. 1-7, one or more other devices or circuits configured to storeone or more bits, or any combination thereof.

Those of skill would further appreciate that the various illustrativelogical blocks, configurations, modules, circuits, and algorithm stepsdescribed in connection with the embodiments disclosed herein may beimplemented as electronic hardware, computer software executed by aprocessor, or combinations of both. Various illustrative components,blocks, configurations, modules, circuits, and steps have been describedabove generally in terms of their functionality. Whether suchfunctionality is implemented as hardware or processor executableinstructions depends upon the particular application and designconstraints imposed on the overall system. Skilled artisans mayimplement the described functionality in varying ways for eachparticular application, but such implementation decisions should not beinterpreted as causing a departure from the scope of the presentdisclosure.

The steps of a method or algorithm described in connection with theembodiments disclosed herein may be embodied directly in hardware, in asoftware module executed by a processor, or in a combination of the two.A software module may reside in random access memory (RAM), flashmemory, read-only memory (ROM), programmable read-only memory (PROM),erasable programmable read-only memory (EPROM), electrically erasableprogrammable read-only memory (EEPROM), registers, hard disk, aremovable disk, a compact disc read-only memory (CD-ROM), or any otherform of non-transient storage medium known in the art. An exemplarystorage medium is coupled to the processor such that the processor canread information from, and write information to, the storage medium. Inthe alternative, the storage medium may be integral to the processor.The processor and the storage medium may reside in anapplication-specific integrated circuit (ASIC). The ASIC may reside in acomputing device or a user terminal. In the alternative, the processorand the storage medium may reside as discrete components in a computingdevice or user terminal.

The previous description of the disclosed embodiments is provided toenable a person skilled in the art to make or use the disclosedembodiments. Various modifications to these embodiments will be readilyapparent to those skilled in the art, and the principles defined hereinmay be applied to other embodiments without departing from the scope ofthe disclosure. Thus, the present disclosure is not intended to belimited to the embodiments shown herein but is to be accorded the widestscope possible consistent with the principles and novel features asdefined by the following claims.

What is claimed is:
 1. An apparatus comprising: a first bufferconfigured to store multiple audio data samples and coupled to a firstprocessor that is configured to analyze audio data samples to detect akeyword; and a second buffer configured to store the multiple audio datasamples and coupled to a second processor that is configured toinitialize a speech recognition engine (SRE) based on the multiple audiodata samples, the first buffer having less storage capacity than thesecond buffer.